In this Case study we're going to better understand what the Human Capital Factor is by exploring the input data and calculating for a small sample of companies.
Key takeaways:
Begin by loading all packages used and defining the company universe for this exercise - which will be the same companies from the survey.
# load functions from irrational capital production code package
from irrcap.ETL._util import append_source_files
from irrcap.ETL._ecdp import simplify_respondent_flags
from irrcap.loader import load_ec, load_gd, load_sm, mask_ec_statements, load_ret
from irrcap.ETL._glassdoor_api import get_glassdoor_data
from irrcap.ETL._glassdoor import fix_na_values
from irrcap.products import hcf_score
# load other functions
import plotly.express as px # plotting
import logging
import s3fs # connect to AWS
import pandas as pd
import random
logger = logging.getLogger(name="deep-fin-demo")
logger.setLevel(logging.DEBUG)
C:\Users\pmelg\anaconda3\lib\site-packages\paramiko\transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated "class": algorithms.Blowfish,
companies = {'GOOG': 1500297,
'AMZN': 129769,
'GM': 525728,
'F': 136149,
'WMT': 128310,
'TGT': 142688,
'CMCSA': 141020,
'DISH': 141597,
'KO': 140623,
'PEP': 132785,
'HD': 144225,
'LOW': 137821,
'NVDA': 146481,
'AMD': 144161,
'RCL': 143310,
'CCL': 133543,
'ODFL': 130943,
'WERN': 126033,
'WD': 529468,
'NYCB': 131030,
'DLR': 137967,
'PSA': 139880,
'GPK': 140475,
'AVY': 126237,
'DXC': 130451,
'DAVA': 2417454,
'TREE': 406454,
'GDOT': 521547,
'AAP': 145579,
'PAG': 127560,
'CCO': 181652,
'IPG': 133566}
likert = {0:'Strongly Disagree',1:'Disagree',2:'Somewhat Disagree',3:'Neither Agree nor Disagree',
4:'Somewhat Agree',5:'Agree',6:'Stongly Agree'}
With our notebook set up we'll now explore the data that is provided by our private data partner. We'll begin by analyzing the data exactly as it is provided by the source (our private data partner).
# load all source files and combine into one dataframe called ecs
ecs = append_source_files('s3://irrcap-data/deep-fin/ecdp/source/Incremental/',s3fs.S3FileSystem(), logger=logger)
ecs = ecs.reset_index(drop=True)
ecs.to_parquet('s3://irrcap-data/deep-fin/ecdp/ecdp_incremental_source.pq') # write that dataframe to the data lake for posterity
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials INFO:deep-fin-demo:Appending 12 files from s3://irrcap-data/deep-fin/ecdp/source/Incremental/.
ecs = mask_ec_statements(ecs) # rename certain columns to aid in masking the identity of our data partner
ecs.columns
Index(['HASH', 'SURVEYRESPONDERID', 'STOCK_SYMBOL', 'COMPANYID', 'COMPANYNAME',
'AVERAGE_RESPONSE_DATE', 'DEPARTMENT', 'PARENT_DEPARTMENT',
'TOP_PARENT_DEPARTMENT', 'JOBGRADEID', 'TENUREID', 'SALARYBANDID',
'EMPLOYMENTTYPEID', 'GENDER_REPORTED', 'GENDER_PREDICTED',
'NATIONAL_EMPLOYEE_COUNT', 'OrgAlignment1', 'DirectMgmt1',
'OrgEffectiveness1', 'Extrinsic1', 'Extrinsic2', 'EmotionalConn1',
'Extrinsic3', 'OrgAlignment2', 'OrgEffectiveness2', 'OrgAlignment3',
'Engagement1', 'OrgAlignment4', 'DirectMgmt2', 'OrgAlignment5',
'EmotionalConn2', 'Engagement2', 'Engagement3', 'EmotionalConn3',
'Engagement4', 'DirectMgmt3', 'EmotionalConn4', 'DirectMgmt4',
'Innovation1', 'EmotionalConn5', 'OrgAlignment6', 'Innovation2',
'DirectMgmt5', 'DirectMgmt6', 'Engagement5', 'DirectMgmt7',
'EmotionalConn6', 'DirectMgmt8', 'EmotionalConn7', 'Engagement6',
'OrgEffectiveness3', 'OrgAlignment7', 'CW1', 'CW2', 'CW3',
'OrgAlignment1_ex', 'DirectMgmt1_ex', 'OrgEffectiveness1_ex',
'Extrinsic1_ex', 'Extrinsic2_ex', 'EmotionalConn1_ex', 'Extrinsic3_ex',
'OrgAlignment2_ex', 'OrgEffectiveness2_ex', 'OrgAlignment3_ex',
'Engagement1_ex', 'OrgAlignment4_ex', 'DirectMgmt2_ex',
'OrgAlignment5_ex', 'EmotionalConn2_ex', 'Engagement2_ex',
'Engagement3_ex', 'EmotionalConn3_ex', 'Engagement4_ex',
'DirectMgmt3_ex', 'EmotionalConn4_ex', 'DirectMgmt4_ex',
'Innovation1_ex', 'EmotionalConn5_ex', 'OrgAlignment6_ex',
'Innovation2_ex', 'DirectMgmt5_ex', 'DirectMgmt6_ex', 'Engagement5_ex',
'DirectMgmt7_ex', 'EmotionalConn6_ex', 'DirectMgmt8_ex',
'EmotionalConn7_ex', 'Engagement6_ex', 'OrgEffectiveness3_ex',
'OrgAlignment7_ex', 'source_file'],
dtype='object')
Group the various columns into categories
# un-interesting id columns
id_cols = ['HASH', 'SURVEYRESPONDERID','AVERAGE_RESPONSE_DATE']
# columns that reveal information about the underlying companies (proprietary)
company_cols =['STOCK_SYMBOL', 'COMPANYID', 'COMPANYNAME', 'DEPARTMENT', 'PARENT_DEPARTMENT',
'TOP_PARENT_DEPARTMENT','NATIONAL_EMPLOYEE_COUNT']
# columns that capture the demographics of an individual respondent
demo_cols = ['JOBGRADEID', 'TENUREID', 'SALARYBANDID','EMPLOYMENTTYPEID', 'GENDER_REPORTED', 'GENDER_PREDICTED']
# columns that contain the likert-scale responses to each of the survey statements
statement_cols = list(ecs.columns[ecs.columns.str.match('^(?!CW).+\d$')])
# columns used by our data partner to flag certain respondents
flag_cols = list(ecs.columns[ecs.columns.str.match('.+_ex$')])
# columnst that contain text information provided by respondents
text_cols = ['CW1','CW2','CW3']
# column added by us when we combined all the source files.
other_cols = ['source_file']
Look at an example row to better understand what the data looks like.
pd.set_option('display.max_colwidth',75)
pd.set_option('display.max_rows',100)
ecs[id_cols+demo_cols+statement_cols+text_cols].sample(1,random_state=34).T
| 2895 | |
|---|---|
| HASH | AxwSam0fl3dJxdBqlgt15sLASFkMVF2X3pHAKIyaHsXfavZyHyEe0SbBSYUrXDMEZjzrgbp... |
| SURVEYRESPONDERID | 12274238 |
| AVERAGE_RESPONSE_DATE | 2018-09-16 |
| JOBGRADEID | 2.0 |
| TENUREID | 6.0 |
| SALARYBANDID | 5.0 |
| EMPLOYMENTTYPEID | 1.0 |
| GENDER_REPORTED | NaN |
| GENDER_PREDICTED | Male |
| OrgAlignment1 | 5.0 |
| DirectMgmt1 | NaN |
| OrgEffectiveness1 | 5.0 |
| Extrinsic1 | 5.0 |
| Extrinsic2 | 5.0 |
| EmotionalConn1 | NaN |
| Extrinsic3 | 2.0 |
| OrgAlignment2 | NaN |
| OrgEffectiveness2 | 3.0 |
| OrgAlignment3 | 5.0 |
| Engagement1 | 3.0 |
| OrgAlignment4 | 5.0 |
| DirectMgmt2 | NaN |
| OrgAlignment5 | 4.0 |
| EmotionalConn2 | NaN |
| Engagement2 | 5.0 |
| Engagement3 | 4.0 |
| EmotionalConn3 | NaN |
| Engagement4 | 5.0 |
| DirectMgmt3 | NaN |
| EmotionalConn4 | NaN |
| DirectMgmt4 | NaN |
| Innovation1 | 2.0 |
| EmotionalConn5 | NaN |
| OrgAlignment6 | 5.0 |
| Innovation2 | 3.0 |
| DirectMgmt5 | 5.0 |
| DirectMgmt6 | 5.0 |
| Engagement5 | 5.0 |
| DirectMgmt7 | 5.0 |
| EmotionalConn6 | 3.0 |
| DirectMgmt8 | NaN |
| EmotionalConn7 | NaN |
| Engagement6 | 3.0 |
| OrgEffectiveness3 | 4.0 |
| OrgAlignment7 | 4.0 |
| CW1 | Friendly |
| CW2 | progressive |
| CW3 | accurate |
Now we'll run our process that simplifies these flags down to the key information
ecs = simplify_respondent_flags(ecs)
ecs.groupby(['Flag1','Flag2']).size()
INFO:ecdp-etl:Simplifying respondent flags. DEBUG:ecdp-etl:Shape before dropping flags: (277937, 92). Shape after dropping flags: (277937, 58).
Flag1 Flag2
False False 277888
True 20
True False 18
True 11
dtype: int64
With an understanding of our data as provided by the data partner, it's now time to run the remaining cleaning steps to get the data ready to by loaded by the model
# remove source file column
ecc = ecs.drop(columns = 'source_file')
# cast companyid as a string (needed to join to our company master list)
ecc['COMPANYID'] = ecc.COMPANYID.astype(str)
# add Survey Year and Trade Year definitions
ecc['AVERAGE_RESPONSE_DATE'] = pd.to_datetime(ecc['AVERAGE_RESPONSE_DATE'], errors = 'coerce')
ecc['SurveyYear'] = ecc['AVERAGE_RESPONSE_DATE'].dt.year
ecc['TradeYear'] = ecc['SurveyYear'] + 1
# write back to data lake
ecc.to_parquet('s3://irrcap-data/deep-fin/ecdp/ecdp.pq')
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
With the clean data written to the data lake, now we'll re-load it using our loader function
# use structured data import function
ec = load_ec(ENV='deep-fin', # load from which data warehouse environment
source=False, # use source data or clean data
cache=False, # use local cache (if exists?)
drop_incomplete=False, # only include statements with complete history
drop_flagged=True, # drop rows flagged as bad reader or unthoughtful
add_features=True, # add additional feature columns used in other research
add_keys=True # add keys to demographic columns
)
#ec.columns[~ec.columns.isin(ecc.columns)]
The private data is now loaded exactly as it is needed to calculate the Human Capital Factor Score. With our data analysis-ready, we can do some exploration to see what exactly is in this data.
fig = px.bar(ec[statement_cols].count().sort_index(), title = 'Number of Responses Per Statement')
fig.update_layout(showlegend = False)
figdat = (ec.melt(value_vars = statement_cols)
.rename(columns = {'value':'Response','variable':'Total Responses'})
.groupby('Response')
.count()
.rename(index=likert))
fig = px.bar(figdat,y = 'Total Responses',title = 'Distribution of All Responses Across the Likert Scale',height = 400)
fig.update_layout(showlegend = False)
ecm = ec.melt(id_vars = 'AssetID',value_vars = statement_cols)
ecm = ecm.rename(columns = {'value':'Response','variable':'Statement'})
ecm['Statement2'] = ecm['Statement'].apply(lambda x: x[:-1])
figdat2 = (ecm
.groupby(['Statement2','Response'])
.size()
.rename('N')
.to_frame()
.reset_index()
.sort_values('Statement2'))
tots = figdat2.groupby('Statement2')['N'].sum().rename('Tot')
figdat2 = figdat2.join(tots,on='Statement2')
figdat2['pct'] = figdat2['N']/figdat2['Tot']
fig = px.bar(figdat2,x = 'Response',y = 'pct',facet_col = 'Statement2',facet_col_wrap = 4,height = 800)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()
plotdat = (ecm
.groupby(['AssetID','Statement2'])
['Response'].mean()
.reset_index()
.pivot(index='AssetID',columns = 'Statement2', values='Response')
.corr()
)
#ec[sorted(statement_cols)].dropna(axis = 0, how='all').dropna(axis = 1, how = 'all').corr()
for i,r in enumerate(plotdat.index):
for j,c in enumerate(plotdat.columns):
if i < j:
plotdat.loc[r,c] = None
fig = px.imshow(plotdat
,height = 800,color_continuous_scale='Portland'
,title = 'Correlation Between Statement Responses')
fig.update_layout(xaxis_title=None, yaxis_title=None)
fig.show()
The proprietary data also has a wealth of demographic information. Each respondent is identified by their tenure, salary, management level (job grade), full-time status, and gender.
plotdat = pd.DataFrame({})
ec['Salary band'] = ec['Salary band'].str.replace('$','',regex=False).str.replace(' - ','-',regex=False)
for d in ['Job grade','Tenure band','Salary band','Employment type','Gender']:
plotdat = plotdat.append(ec.groupby(d).size().rename('N').reset_index().melt(id_vars = 'N'))
fig = px.bar(plotdat,y='N',x='variable',color='value',text='value',height = 800
,title = 'Distribution of Respondents for Each Demographic Category'
,category_orders = {'value': ['Less than 30,000','30,000-50,000','50,000-100,000'
,'More than 100,000'
,'Less than 6 months','6 to 12 months','1 to 3 years'
,'3 to 5 years','5 to 10 years','10 to 15 years'
,'More than 15 years'
,'Team member','Manager','Senior manager']})
fig.update_layout(showlegend=False,xaxis_title=None,yaxis_title=None)
fig.show()
For part 2 we encourage interested participants to follow along and do their own exploration using the provided participant notebook.
In this section we'll get a better understanding of the public data then we'll look at financial returns for a naive strategy using the public data.
Begin by loading the analysis-ready data using our loader functions.
gd0 = load_gd(ENV='deep-fin', # load from which data warehouse environment
source=False, # use source data or clean data
cache=False, # use local cache (if exists)
ratings_only=True, # only load ratings and company ID columns
)
gd0['TradeYear'] = gd0['SurveyYear']+1
# write to csv (then commit to github) so the data can be used for the participant notebook
gd0.to_csv('glassdoor_DeepFin_sample.csv',index=False)
# read data back in from github
gd = pd.read_csv('https://raw.githubusercontent.com/IrrationalCapital/DeepFin/main/glassdoor_DeepFin_sample.csv')
# get list of all columns
gd.columns
Index(['Ticker Symbol', 'Ticker', 'Exchange', 'As Of Date', 'SurveyYear',
'Rating: Overall', 'Rating: Work/Life Balance',
'Rating: Culture & Values', 'Rating: Career Opportunities',
'Rating: Comp & Benefits', 'Rating: Senior Management',
'Ticker Symbol_join', 'AssetID', 'TradeYear'],
dtype='object')
pd.set_option('display.max_rows',75)
gd.sample(1,random_state = 34).T
| 103467 | |
|---|---|
| Ticker Symbol | nyse:acn |
| Ticker | ACN |
| Exchange | NYSE |
| As Of Date | 2018-06-11 00:00:00+00:00 |
| SurveyYear | 2018.0 |
| Rating: Overall | 5.0 |
| Rating: Work/Life Balance | 5.0 |
| Rating: Culture & Values | 5.0 |
| Rating: Career Opportunities | 4.0 |
| Rating: Comp & Benefits | 3.0 |
| Rating: Senior Management | 4.0 |
| Ticker Symbol_join | nyse:acn |
| AssetID | 134666 |
| TradeYear | 2019.0 |
Now we'll expore the public data in similar way to what we did with the private data
rating_cols = gd.columns[gd.columns.str.match('Rating:')]
fig = px.bar(gd[rating_cols].count(), title = 'Number of Ratings Per Category',height = 400)
fig.update_layout(showlegend = False,xaxis_title=None,yaxis_title=None)
fig.show()
figdat3 = (gd.reset_index()
.melt(value_vars = rating_cols)
.rename(columns = {'value':'Rating','variable':'Total Responses'})
.groupby('Rating')
.count())
fig = px.bar(figdat3,y = 'Total Responses',title = 'Distribution of Glassdoor Ratings')
fig.update_layout(showlegend = False,xaxis_title=None)
fig.show()
figdat4 = (gd.melt(value_vars = rating_cols)
.rename(columns = {'value':'Rating','variable':'Category'})
.groupby(['Category','Rating'])
.size()
.rename('Total Responses')
.to_frame()
.reset_index())
fig = px.bar(figdat4,x = 'Rating',y = 'Total Responses',facet_col = 'Category',facet_col_wrap = 3,height = 500,
title = 'Rating Distribution by Category')
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()
plotdat = gd[sorted(rating_cols)].dropna(axis = 0, how='all').dropna(axis = 1, how = 'all').corr()
for i,r in enumerate(plotdat.index):
for j,c in enumerate(plotdat.columns):
if i < j:
plotdat.loc[r,c] = None
px.imshow(plotdat
,height = 800,color_continuous_scale='Portland'
,title = 'Correlation Between Ratings')
Next we'll bring in the return stream for the public data and test a naive strategy based on this data.
# prepare data for the participant notebook
r = load_ret()
r = r.loc[r.AssetID.isin(gd.AssetID) & (r.Date.dt.year == 2019),['Date','Close','AssetID']]
r.to_csv('glassdoor_DeepFin_returns.csv',index=False)
# read data back in from github
ret = pd.read_csv('https://raw.githubusercontent.com/IrrationalCapital/DeepFin/main/glassdoor_DeepFin_returns.csv')
ret.head(10)
| Date | Close | AssetID | |
|---|---|---|---|
| 0 | 2019-01-02 | 37.38316 | 124968 |
| 1 | 2019-01-03 | 37.91329 | 124968 |
| 2 | 2019-01-04 | 38.22966 | 124968 |
| 3 | 2019-01-07 | 38.14416 | 124968 |
| 4 | 2019-01-08 | 39.23008 | 124968 |
| 5 | 2019-01-09 | 38.89661 | 124968 |
| 6 | 2019-01-10 | 39.62341 | 124968 |
| 7 | 2019-01-11 | 39.85427 | 124968 |
| 8 | 2019-01-14 | 39.46950 | 124968 |
| 9 | 2019-01-15 | 40.00819 | 124968 |
def run_strategy(ret, top_asset_ids):
# weight everything based on day 0
ret = ret.merge(ret.loc[ret.Date == '2019-01-02',['AssetID','Close']],on='AssetID',suffixes = ('','_0'))
ret['Close2'] = ret['Close']/ret['Close_0']
# define top and bottom half
ret['strat'] = 'Bottom Half'
ret.loc[ret.AssetID.isin(top_asset_ids),'strat'] = 'Top Half'
# get daily pct change for each aggregated strat group then calculate daily returns vs day0
ret1 = ret.groupby(['strat','Date'])['Close2'].sum().reset_index().sort_values('Date')
ret1['DayChange'] = (ret1.groupby('strat')['Close2'].pct_change() + 1).fillna(1)
ret1['Return'] = (ret1.groupby('strat')['DayChange'].cumprod())-1
return ret1
gdm = gd.melt(id_vars = ['AssetID'], value_vars = gd.columns[gd.columns.str.match('Rating')])
ar = gdm.groupby(['AssetID'])['value'].agg(['mean'])
ar['rank'] = ar['mean'].rank(ascending = False,pct = True)
ar = ar.reset_index()
ar.sort_values('rank')
ro_top = ar.loc[ar['rank'] <= .5,'AssetID']
strat1 = run_strategy(ret,ro_top)
strat1 = strat1.rename(columns = {'strat':'Mean Overall Rating'})
fig = px.line(strat1,x='Date',y='Return',color = 'Mean Overall Rating'
,title = 'Performance of Top Half vs Bottom Half by Overall Rating')
fig.update_layout(yaxis_tickformat=',.0%',xaxis_title=None
,legend={'orientation':"h",'yanchor':"bottom", 'y':-.35, 'xanchor':'center', 'x':.5})
With all of our analysis-ready data loaded the final step is to run the HCF process and look at the results. For simplicity, we'll keep the process to the the 36 companies in from our earlier survey.
# reload glassdoor data with all companies to caluclate HCF score
gd2 = load_gd(ENV = 'prod',ratings_only=True)
gd2['TradeYear'] = gd2['SurveyYear']+1
ec_sample = ec.loc[ec.AssetID.isin(list(companies.values()))].copy() # filter to companies in survey
gd_sample = gd2.loc[gd2.AssetID.isin(list(companies.values()))].copy() # filter public data
scores_df = hcf_score(ec_sample, gd_sample, TradeYear=2019, ec_q=statement_cols, gd_q=list(rating_cols))
INFO:hcf_scores.alpha_blending:Calculating Human Capital Factor Scores for TradeYear 2019 using Alpha Blending. INFO:hcf_scores.alpha_blending:ECDP data has initial shape (9600, 38) and glassdoor data has initial shape (43788, 8) INFO:hcf_scores.alpha_blending:After filtering, 7 companies from ECDP will be used to calculate HCF. INFO:hcf_scores.alpha_blending:31 companies from glassdoor will be used to calculate HCF. C:\Users\pmelg\anaconda3\lib\site-packages\fancyimpute\solver.py:55: UserWarning: Input matrix is not missing any values INFO:hcf_scores.alpha_blending:Running Alpha Blending INFO:hcf_scores.alpha_blending:HCF successfully calculated for Trade Year 2019 / Threshold 25 with 32 companies in result.
scores_df = scores_df.join(pd.DataFrame({'Company':companies.keys()},index=companies.values()),on='AssetID')
#scores_df['HCFDecile'] = scores_df['HCFDecile'].quantile(.1)
scores_df['HCFScore2'] = (scores_df.HCFScore-scores_df.HCFScore.min())/(scores_df.HCFScore.max()-scores_df.HCFScore.min())
scores_df[['AssetID','Company','HCFScore2']].sort_values('HCFScore2',ascending=False).reset_index(drop=True)
| AssetID | Company | HCFScore2 | |
|---|---|---|---|
| 0 | 1500297 | GOOG | 1.000000 |
| 1 | 137967 | DLR | 0.968567 |
| 2 | 146481 | NVDA | 0.889265 |
| 3 | 2417454 | DAVA | 0.876513 |
| 4 | 136149 | F | 0.726040 |
| 5 | 529468 | WD | 0.721217 |
| 6 | 144161 | AMD | 0.710634 |
| 7 | 143310 | RCL | 0.689253 |
| 8 | 130943 | ODFL | 0.686114 |
| 9 | 129769 | AMZN | 0.663586 |
| 10 | 140623 | KO | 0.658519 |
| 11 | 525728 | GM | 0.651586 |
| 12 | 128310 | WMT | 0.650479 |
| 13 | 133566 | IPG | 0.642027 |
| 14 | 406454 | TREE | 0.574135 |
| 15 | 144225 | HD | 0.544100 |
| 16 | 181652 | CCO | 0.518664 |
| 17 | 126237 | AVY | 0.494724 |
| 18 | 132785 | PEP | 0.485166 |
| 19 | 133543 | CCL | 0.483914 |
| 20 | 137821 | LOW | 0.464362 |
| 21 | 142688 | TGT | 0.450440 |
| 22 | 145579 | AAP | 0.290965 |
| 23 | 127560 | PAG | 0.286532 |
| 24 | 139880 | PSA | 0.244325 |
| 25 | 141020 | CMCSA | 0.229778 |
| 26 | 126033 | WERN | 0.227274 |
| 27 | 140475 | GPK | 0.220946 |
| 28 | 131030 | NYCB | 0.212043 |
| 29 | 141597 | DISH | 0.190368 |
| 30 | 130451 | DXC | 0.105846 |
| 31 | 521547 | GDOT | 0.000000 |